The topic of my final project is the ongoing Coronavirus pandemic in California. This project is very topical because of the impact the virus is currently having on the world and on California. By examining a dataset from the Los Angeles Times, I hope to find meaningful insights about the effects of the virus across different areas of California, institutions, and times of year. The questions that I am asking are as follows.
The data source that I am using is The Los Angeles Times’ open-source archive of California Coronavirus data. The landing page is on github and the link is here. State and county totals come from the California Department of Public Health. Data on hospitalizations, tests, demographics and reopening plans also come from the state health department. The nursing home totals, including skilled-nursing facilities and assisted-living facilities, are monitored by the California Department of Social Services. The numbers are gathered and posted each day.
There should not be any ethical issues related to how the data was gathered. The LA Times Coronavirus tracker details that almost all of the data comes from the California Department of Public Health. The figures for nursing homes are monitored by the California Department of Social Services. The data does not seem to have any ethical issues with its collection because it was carried out by these two government entities. It is highly likely that these entities include all of California in their data and do not systematically exclude any groups. The data figures are also not scraped from anywhere, which could cause potential legal or moral complications, but are taken directly from the hospitals.
No ethical issues should arise from the analysis. The analysis mostly focuses on the state as a whole, and when it is on a more local level, by county. The analysis looks mainly for trends of how the cases and deaths in the state have evolved over time and as a result of vaccinations. The only issue that could possibly arise is that the county level analysis could have bias based on the demographics of people that live there. However, the main purpose of a county level is to examine geographical impact, not the social one.
The dataset that I am using is already public on github and contains aggregate totals for cases and deaths. This is great because the data contains no individual data, which could possibly be used to identify specific people.
The libraries used for this report are tidyverse, modelr, rjson, and plotly. Tidyverse is mainly used because of the read_csv() function, tibbles, and pipes. modelr is very useful because of the modelling capabilities that the library brings to the data. Plotly and rjson are used for their data visualization power, which includes a mapbox choropleth map that I will create in this report.
library(tidyverse)
library(modelr)
library(rjson)
library(plotly)
The data for this project comes from csvs posted on the LA Times’ Coronavirus github repository. There are three sets of data imports for each of the three questions that we are trying to answer. The only function that is used is read_csv(), which takes the path to the csv (or link to a csv in this case) and returns the csv as a tibble.
state_vax <- read_csv("https://raw.githubusercontent.com/datadesk/california-coronavirus-data/master/cdph-vaccination-state-totals.csv")
state_hospital <- read_csv("https://raw.githubusercontent.com/datadesk/california-coronavirus-data/master/cdph-hospital-patient-state-totals.csv")
state_deaths <- read_csv("https://raw.githubusercontent.com/datadesk/california-coronavirus-data/master/cdph-state-cases-deaths.csv")
nursing_totals <- read_csv("https://raw.githubusercontent.com/datadesk/california-coronavirus-data/master/cdph-skilled-nursing-totals.csv")
The data that is read into R contains more columns than are needed for the question we are trying to answer. In the first line, we use select() to choose the 5 columns that we will use for our analysis. The state_vax table with 5 columns is stored as state_vax1. The state_vax table has fewer entries than the state_hospital table, so it is necessary to use filter() to start both tables’ “date” columns on the same date. The filtered state_hospital table is stored as state_hospital1. The third line left joins state_vax1 to state_hospital1 by the “date” column and stores the resulting tibble as joined_1. The fourth line performs multiple transformations to joined_1. First, it sorts the table by the date column, creates a new variable that is the change in new patients that are positive with COVID, and selects only the 5 columns that are needed for this analysis.
state_vax1 <- state_vax %>% select("date", "doses_administered", "new_doses_administered", "at_least_one_dose_percent", "fully_vaccinated_percent")
#Chose 2020-07-27 because that is the first date in state_vax.
state_hospital1 <- state_hospital %>% filter(date >= "2020-07-27")
joined_1 <- state_hospital1 %>% left_join(state_vax1, by = "date")
joined_2 <- joined_1 %>% arrange(date) %>% mutate(new_patients = positive_patients - lag(positive_patients)) %>% select(date, positive_patients, new_patients, new_doses_administered,fully_vaccinated_percent)
The below code chunk creates a season indicator variable. This is done by mainly using logic to assign the indicator if the date is less than the nearest solstice or equinox.
joined_2$season <- ifelse(joined_2$date <= as.Date("2020-09-22", "%Y-%m-%d"), "Summer", ifelse(joined_2$date <= as.Date("2020-12-21", "%Y-%m-%d"), "Autumn", ifelse(joined_2$date <= as.Date("2021-03-21", "%Y-%m-%d"), "Winter", ifelse(joined_2$date <= as.Date("2021-06-21", "%Y-%m-%d"), "Spring", ifelse(joined_2$date <= as.Date("2021-09-22", "%Y-%m-%d"), "Summer", ifelse(joined_2$date <= as.Date("2021-12-21", "%Y-%m-%d"), "Autumn", 0))))))
joined_2
## # A tibble: 504 x 6
## date positive_patients new_patients new_doses_admini~ fully_vaccinated~
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-07-27 6896 NA 3 0
## 2 2020-07-28 6939 43 4 0.0000000255
## 3 2020-07-29 6753 -186 8 0.0000000255
## 4 2020-07-30 6630 -123 11 0.0000000511
## 5 2020-07-31 6477 -153 6 0.0000000511
## 6 2020-08-01 6362 -115 11 0.0000000766
## 7 2020-08-02 6399 37 3 0.000000102
## 8 2020-08-03 6302 -97 12 0.000000128
## 9 2020-08-04 6184 -118 17 0.000000128
## 10 2020-08-05 6069 -115 17 0.000000128
## # ... with 494 more rows, and 1 more variable: season <chr>
The first line of code makes a tibble called nurse_join. nurse_join is the data in nursing_totals, but the dates are filtered to be before date_var. Also, only 5 columns are selected and a new variable called deaths_total is created. The second line of code makes state_join, which is the data in state_deaths filtered to be the dates before date_var and only the columns: date, confirmed_case, and reported_deaths. The last line left joins state_join to nurse_join by the date column.
nurse_join <- nursing_totals %>% filter(date <= date_var) %>% select(date, patients_confirmed_cases, staff_confirmed_cases, staff_deaths, patients_deaths) %>% mutate(deaths_total = staff_deaths + patients_deaths)
state_join <- state_deaths %>% filter(date <= date_var) %>% select(date, confirmed_cases, reported_deaths)
nurses_joined <- nurse_join %>% left_join(state_join, by = "date")
nurses_joined
## # A tibble: 554 x 8
## date patients_confirmed~ staff_confirmed_~ staff_deaths patients_deaths
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2021-12-08 65583 55627 252 9367
## 2 2021-12-07 65565 55603 252 9363
## 3 2021-12-06 65541 55579 251 9362
## 4 2021-12-05 65529 55569 251 9361
## 5 2021-12-04 65515 55556 251 9361
## 6 2021-12-03 65490 55534 251 9361
## 7 2021-12-02 65449 55508 251 9355
## 8 2021-12-01 65430 55476 251 9355
## 9 2021-11-30 65395 55451 251 9353
## 10 2021-11-29 65378 55429 251 9351
## # ... with 544 more rows, and 3 more variables: deaths_total <dbl>,
## # confirmed_cases <dbl>, reported_deaths <dbl>
The first relationship we want to examine is between hospitalizations in California and vaccinations in California. To start with, we will use the daily change in new doses of vaccines administered and the daily change in new hospitalizations. The first model will be a linear regression model of the effect of new_doses_administered on new_patients. R makes this easy because the built-in lm() function performs an ordinary least squares regression.
joined_2_mod <- lm(new_patients ~ new_doses_administered , data = joined_2)
joined_2 %>% ggplot(aes(x = new_doses_administered, y = new_patients)) + geom_point() + geom_abline(aes(intercept = joined_2_mod$coefficients[1], slope = joined_2_mod$coefficients[2]), color = "red", lwd = 1.3)
joined_2_mod
##
## Call:
## lm(formula = new_patients ~ new_doses_administered, data = joined_2)
##
## Coefficients:
## (Intercept) new_doses_administered
## 48.0974568 -0.0004581
Before interpreting the results, let us first check if the model works.
We will need to check:
1. If the average of the residuals equals zero
2. That the residuals are spread out
3. The residuals are independent of each other (there is no clear pattern)
If our model is working as intended, the average of the residuals should be zero. It is comforting to see that the average_resid is approximately equal to 0. The graph also shows spread out residuals. The residual point plot’s points are also not displaying a clear pattern, which makes it seem like all the residuals are independent of each other.
joined_2 <- joined_2 %>% add_residuals(joined_2_mod)
joined_2 %>% summarize(average_resid = mean(resid, na.rm = TRUE))
## # A tibble: 1 x 1
## average_resid
## <dbl>
## 1 -5.53e-15
joined_2 %>% ggplot() + geom_density(aes(resid))
joined_2 %>% ggplot(aes(new_doses_administered, resid)) + geom_point()
The equation of the line is: \[newPatients = -0.0004581newDoses + 48.09746\]
This displays a negative relationship between the number of new doses and the number of new positive patients. This means that the more new doses of COVID vaccines that are administered, the less new patients there are in the hospitals.
Let us include another variable of interest into this equation. It is possible that adding the season of the year has an effect on the number of new patients in the hospitals. A season indicator could provide interesting results because date is not a factor until this point.
The below graph is the same point plot as before, but the color of the points corresponds to the season indicator.
joined_2 %>% ggplot(aes(x = new_doses_administered, y = new_patients, color = season)) + geom_point()
The following code chunk creates two models, one without interaction between new_doses_administered and season, and one with interaction between the two. Then, predictions for the two models are added to a data grid. A preview of the grid is printed below. It includes the model used to generate the predictions and the unique values of new_doses_administered and season together.
mod1 <- lm(new_patients ~ new_doses_administered + season, data = joined_2)
mod2 <- lm(new_patients ~ new_doses_administered * season, data = joined_2)
grid <- joined_2 %>% data_grid(new_doses_administered, season) %>% gather_predictions(mod1, mod2)
head(grid)
## # A tibble: 6 x 4
## model new_doses_administered season pred
## <chr> <dbl> <chr> <dbl>
## 1 mod1 0 Autumn 104.
## 2 mod1 0 Spring 93.5
## 3 mod1 0 Summer 23.5
## 4 mod1 0 Winter -94.5
## 5 mod1 1 Autumn 104.
## 6 mod1 1 Spring 93.5
Below is a useful visualization of the two models that can help decide which model is better fitted. In the first model, it can be seen that not allowing for interaction keeps the slopes constant for each season. When there is interaction between new_doses_administered and season, the slope changes for each season. The second model seems to be better for this analysis because the slope is allowed to change dramatically. The slope of summer even became steeply positive!
ggplot(data = joined_2, aes(new_doses_administered, new_patients, color = season)) +
geom_point() +
geom_line(data = grid, aes(y = pred)) +
facet_wrap(~ model)
Before interpreting the better model’s coefficients (the second model) let us double check that the residuals support mod2 as the better model. The residuals in mod2 seem to have a more independent relationship than the residuals in mod1. This supports mod2 as the better model.
joined_2_1 <- joined_2 %>% gather_residuals(mod1, mod2)
ggplot(joined_2_1, aes(new_doses_administered, resid, col = season)) +
geom_point() +
facet_grid(model ~ season)
The coefficients from the model with interaction are viewable below. The sign of the predicted models are the most significant and interpretable. Summer has a positive slope, which displays that in the summer, even when the new_doses_administered increased, the new_patients admitted to the hospital for COVID increased. The other three seasons all have negative coefficients, which makes summer the most infectious season. This observation is contrary to what many may assume.
ggplot(joined_2) + geom_line(aes(x=date,y=new_patients))+
scale_x_date(date_breaks = "months" , date_labels = "%b-%y") + theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) +
geom_vline(xintercept = as.Date("2020-09-22"),linetype=4, colour= "orangered2") +
geom_vline(xintercept = as.Date("2020-12-22"),linetype=4, colour="dodgerblue2") +
geom_vline(xintercept = as.Date("2021-03-21"),linetype=4, colour="yellow3") +
geom_vline(xintercept = as.Date("2021-06-21"),linetype=4, colour="red") +
geom_vline(xintercept = as.Date("2021-09-22"),linetype=4, colour="orangered2")
mod2
##
## Call:
## lm(formula = new_patients ~ new_doses_administered * season,
## data = joined_2)
##
## Coefficients:
## (Intercept) new_doses_administered
## 1.322e+02 -8.155e-04
## seasonSpring seasonSummer
## -1.344e+02 -1.874e+02
## seasonWinter new_doses_administered:seasonSpring
## -1.486e+02 7.569e-04
## new_doses_administered:seasonSummer new_doses_administered:seasonWinter
## 2.069e-03 -2.652e-05
For examining if the nursing homes fared better than the rest of the state we will look at the trends in the nursing and general case and death lines. To plot the nursing home cases and the total state cases on the same scale a second y-axis was attached to the ggplot graph. The y scale on the left side of both plots pertains to nursing home cases and deaths, while the right side pertains to the total state cases. For the cases graph, the nursing home cases are around 60,000 and the total state are around 6,000,000. Even though the units are of different magnitude, the trends in relation to each other are still interpretable.
A difference in the cases and deaths graphs between the state and nursing homes is the uptick around September 2021. At around September 2021 there is a sizable visible increase in the slope of the cases and deaths curves. The patient cases and staff cases curves are practically identical at that position and they both increase less than the state overall curves. While this is an example of a difference in the curves it is also important to point out that both sets of curves, for both cases and deaths, rapidly increased together at around December 2020. Then they both stayed relatively flat until the spike in general California cases in September 2021.
It appears that when both the nursing cases and deaths and the general population cases and deaths spiked, the nursing homes and the general population fared equally well (or unwell) against COVID deaths and infections. Then, during the recent spike in the general population’s infections and deaths in around September 2021, the nursing homes experienced a less extreme increase in cases and deaths.
The below code chunk plots the nursing homes cases and deaths against the California state cases and deaths. It uses a secondary axis which multiplies the units label by the coefficient. This is because the state units are divided by the coefficient so the state trend line can be close to the nursing trend line on the graph.
coeff <- 100
ggplot(nurses_joined, aes(x = date)) + geom_line(aes(y = patients_confirmed_cases, color = "Patient Cases")) + geom_line(aes(y = staff_confirmed_cases, color = 'Staff Cases')) + geom_line(aes(y = confirmed_cases/coeff, color = "Confirmed State")) +
scale_y_continuous(
# Features of the first axis
name = "Nursing Home Cases",
# Add a second axis and specify its features
sec.axis = sec_axis(~.*coeff, name="Total State Cases")
) +
labs(x = "Year",
y = "(%)",
color = "Legend")
coeff <- 10
ggplot(nurses_joined, aes(x = date)) + geom_line(aes(y = deaths_total, color = "Patient and Staff Deaths")) + geom_line(aes(y = reported_deaths/coeff, color = 'State Deaths')) +
scale_y_continuous(
# Features of the first axis
name = "Nursing Home Deaths",
# Add a second axis and specify its features
sec.axis = sec_axis(~.*coeff, name="Total State Deaths")
) +
labs(x = "Year",
y = "(%)",
color = "Legend")
It seems that new vaccinations in California and new patients hospitalized with COVID-19 have a negative relationship. This means that the more new vaccinations there are in California, the fewer new patients hospitalized for COVID-19 in California. This is evident in the least squares regression of new_patients and new_doses_administered having a negative slope. This analysis is very limited in scope, there are many other secondary variables that explain part of the movement in new_patients. To try and gain more complete insight into the movement of new_patients, a season indicator was added to the analysis. This was under the hypothesis that there may be some seasonal influence in the number of new hospital cases. The inclusion of a seasonal indicator did provide more evidence, and the model that includes interaction between season and new_doses_administered proved to be a better model than no interaction. The mod2 model displays a negative relationship between new_doses_administered and new_patients for all seasons except summer (Autumn, Winter, and Spring). While this conclusion is very interesting because many forecasted winter as being the worst season for COVID, a more comprehensive investigation is needed before the results can be fully accepted. It is possible that there is a confounding variable that changes our results dramatically. However, the hypothesis that season and new_doses_administered have an effect on the new hospital patients would be a great next investigative step.
Compared to the rest of the state it seems like the nursing homes had a similar trend in the beginning of the pandemic, but the more recent spike in around September 2021 does not effect the nursing homes as much as the rest of the state. It is possible to interpret from these conclusions that nursing homes have been able to adapt since the earlier times of the virus to better protect their staff and patients. It is possible that the spike was not as deadly in the nursing homes because of vaccines in the nursing homes and stricter efforts to keep the virus out than the general state.